Data science project for ANLY-501 | A data science tale about a coffee company.

This notebook provides a project report for the data science project done as part of ANLY-501. The data science project intends to show all stages of the data science pipeline. This notebook facilitates that by organizing different phases into different sections and having the code and visualizations all included in one place. Install Jupyter notebook software available from http://jupyter.org/ to run this notebook. All code for this project is available on Github as https://github.com/aarora79/sb_study and there is also a simple website associated with it https://aarora79.github.io/sb_study/. The entire code (including the generated output) can be downloaded as a zip file from Github via this URL https://github.com/aarora79/sb_study/archive/master.zip. To run the code, simply unzip the master.zip and say "python main.py". The code requires that Python 3.5.2|Anaconda 4.1.1 be installed on the machine. Also, since the code tries to get the datasets in realtime so an Internet connection is also required.

[Updated for Project3 11/30/2016, starting from section Predictions based on classification].

Data Science Problem

Starbucks Corporation is an American coffee company and coffeehouse chain with more than 24,000 stores across the world. This projects intends to explore the following data science problems:

  1. Exploratory Data Analysis (EDA) about Starbucks store locations for example geographical distribution of stores by country, region, ownership model, brand name etc.
  2. Find a relationship between Starbucks data with various economic and human development indices such as GDP, ease of doing business, rural to urban population ratio, literacy rate, revenue from tourist inflow and so on.
  3. Predict which countries where Starbucks does not have a store today are most suitable for having Starbucks stores (in other words in which country where Starbucks does not have a presence should Starbucks open its next store and how many).

This problem is important because it attempts to provide a model (using Starbucks as an example) which can be applied to any similar business (say in the food and hospitality industry) to predict where it could expand globally. It provides insight about where the business is located currently by bringing out information like say the total number of stores (48) found in the entire continent of Africa is way less than number of stores (184) found just on U.S. Airports.

Potential Analysis that Can Be Conducted Using Collected Data

The data used as part of this project is obtained from two sources.

  1. Starbucks store location data is available from https://opendata.socrata.com via an API. The Socrate Open Data API (SODA) end point for this data is https://opendata.socrata.com/resource/xy4y-c4mk.json

  2. The economic and urban development data for various countries is available from the World Bank(WB) website. WB APIs are described here https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information. The API end point for all the World Development Indicators (WDI) is http://api.worldbank.org/indicators?format=json. Some examples of indicators being collected include GDP, urban to rural population ratio, % of employed youth (15-24 years), international tourism receipts (as % of total exports), ease of doing business and so on.

The possible directions/hypothesis based on collected data (this is not the complete list, would be expanded as the project goes on):

  1. EDA about Starbucks store locations (for example):

    • What percentage of stores exists in high income, high literacy, high urban to rural population ratio European countries Vs say high population, rising GDP, low urban to rural population Asian countries.
    • Distribution of stores across geographies based on type of ownership (franchisee, joint venture etc.), brand name etc.
    • Which country, which city has the most Starbucks stores per 1000 people.
    • Is there a Starbucks always open at any UTC time during a 24hour period i.e. you can always find some Starbucks store open time at any given time somewhere in some timezone around the world.
  2. Data visualization of the Starbucks store data:

    • World map showing starbucks locations around the world.
    • Heat map of the world based on the number of Starbucks store in a country.
    • Frequency distribution of Starbucks store by city in a given country. Does this distribution resemble any wellll known statistical distribution.
    • Parallel coordinates based visualization for number of stores combined with economic and urban development indicators.
  3. Machine learning model for predicting number of Starbucks store based on various economic and urban development indicators. This could then be used to predict which countries where Starbucks does not have a presence today would be best suited as new markets. For example, model the number of Starbucks location in a country based on a) ease of doing business, b) GDP, c) urban to rural population, d) employment rate between people in the 15-24year age group, e) type of government, f) access to 24hour electricity and so on and so forth.

Data Issues

The data used for this project is being obtained via APIs from the Socrata web site and the World Bank website and is therefore expected to be relatively error free (for example as compared to the same data being obtained by scraping these websites). Even so, the data is checked for quality and appropriate error handling or even alternate mechanisms are put in place to handle errors/issues with the data.

Issue Handling Strategy
Some of the city names (for examples cities in China) include UTF-16 characters and would therefore not display correctly in documents and charts. Replace city name with country name _1, _2 and so on, for example CN_1, CN_2 etc.
Missing data in any of the fields in the Starbucks dataset. Ignore the data for the location with any mandatory field missing (say country name is missing). Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data.
Incorrect format of the value in various fields in the Starbucks dataset. For example Latitude/Longitude values being out of range, country codes being invalid etc. Ignore the data for the location with any missing value. Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data.
Misc. data type related errors such as date time field (first_seen field in Starbucks dataset) not being correct, fields considered as primary key not being unique (for example store id for Starbuck dataset, Country code for WB dataset) Flag all invalid fields as errors.
Missing data for any of the indicators in the WB dataset. The most recent year for which the data is available is 2015, if for a particular indicator the 2015 data is not available then use data for the previous year i.e. 2014. If no data is available for that indicator even for the previous 5 years then flag it as such and have the user define some custom value for it.
Incorrect format of the value in various fields in the WB dataset. For example alphanumeric or non-numeric data for fields such as GDP for which numeric values are expected. Provide sufficient information to the user (in this case the programmer) about the incorrect data and have the user define correct values.

The subsequent sections of this notebook provide a view of the data for the two datasets used in this project and also provide the data quality scores (DQS) for both the datasets.

Datasets (Starbucks and WorldBank WDI)

The Python code for the SB Study (SBS) is run offline and the results (along with the code) are periodically pushed to Github. The code here simply downloads the files from Github to show the results and how the data looks like.


In [9]:
import pandas as pd
#get the Starbucks dataset from Github
#note that the dataset 
SB_DATASET_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/SB_data_as_downloaded.csv'
df = pd.read_csv(SB_DATASET_URL)

print('Starbucks dataset has %d rows and %d columns ' %(df.shape[0], df.shape[1]))
df.head()


Starbucks dataset has 24823 rows and 21 columns 
Out[9]:
brand city coordinates country country_subdivision current_timezone_offset first_seen latitude longitude name ... ownership_type phone_number postal_code store_id store_number street_1 street_2 street_3 street_combined timezone
0 Starbucks Hong Kong {u'latitude': u'22.3407001495361', u'needs_rec... CN 91 480 2013-12-08T22:41:59 22.340700 114.201691 Plaza Hollywood ... LS {u'phone_number': u'29554570'} NaN 1 34638-85784 Level 2, Plaza Hollywood, Diamond Hill, Kowloon NaN Level 2, Plaza Hollywood, Diamond Hill,, Kowloon China Standard Time
1 Starbucks Hong Kong {u'latitude': u'22.2839393615723', u'needs_rec... CN 91 480 2013-12-08T22:41:59 22.283939 114.158188 Exchange Square ... LS {u'phone_number': u'21473739'} NaN 6 34601-20281 Shops 308-310, 3/F., Exchange Square Podium, Central, HK. NaN Shops 308-310, 3/F.,, Exchange Square Podium, ... China Standard Time
2 Starbucks Kowloon {u'latitude': u'22.3228702545166', u'needs_rec... CN 91 480 2013-12-08T22:41:59 22.322870 114.213440 Telford Plaza ... LS {u'phone_number': u'27541323'} NaN 8 34610-28207 Shop Unit G1A, Atrium A, Telford Plaza I , Kowloon Bay, Kowloon NaN Shop Unit G1A, Atrium A, Telford Plaza I, , Ko... China Standard Time
3 Starbucks Hong Kong {u'latitude': u'22.2844505310059', u'needs_rec... CN 91 480 2013-12-08T22:41:59 22.284451 114.158463 Hong Kong Station ... LS {u'phone_number': u'25375216'} NaN 13 34622-64463 Concession HOK 3a & b LAR Hong Kong Station NaN Concession HOK 3a & b, LAR Hong Kong Station China Standard Time
4 Starbucks Hong Kong {u'latitude': u'22.2777309417725', u'needs_rec... CN 91 480 2013-12-08T22:41:59 22.277731 114.164917 Pacific Place, Central ... LS {u'phone_number': u'29184762'} NaN 17 34609-22927 Shop 131, Level 1, Pacific Place 88 Queensway, HK NaN Shop 131, Level 1, Pacific Place, 88 Queensway... China Standard Time

5 rows × 21 columns


In [10]:
import pandas as pd
#get the Starbucks dataset from Github
#note that the dataset 
WB_DATASET_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/WDI_data_as_downloaded.csv'
df = pd.read_csv(WB_DATASET_URL)

print('Worldbank dataset has %d rows and %d columns ' %(df.shape[0], df.shape[1]))
df.head()


Worldbank dataset has 264 rows and 84 columns 
Out[10]:
country_code IC.TAX.LABR.CP.ZS WP_time_01.1 SP.POP.1564.TO.ZS IC.BUS.NDNS.ZS IC.LGL.CRED.XQ IC.GOV.DURS.ZS DT.DOD.PVLX.GN.ZS SE.ADT.LITR.ZS IC.EXP.COST.CD ... SL.SRV.EMPL.ZS FI.RES.TOTL.DT.ZS IC.FRM.BRIB.ZS IC.TAX.OTHR.CP.ZS IC.REG.COST.PC.ZS IC.ELC.OUTG SL.EMP.WORK.ZS TX.VAL.OTHR.ZS.WT EN.URB.LCTY.UR.ZS SI.POV.2DAY
0 BD NaN NaN 65.579469 NaN 6.0 NaN NaN NaN NaN ... NaN NaN NaN NaN 13.9 NaN NaN 62.451180 31.889816 NaN
1 BE 49.4 NaN 64.830742 NaN 4.0 NaN NaN NaN NaN ... NaN NaN NaN 0.6 4.8 NaN NaN 59.897278 18.516810 NaN
2 BF 21.4 NaN 52.034628 NaN 6.0 NaN NaN NaN NaN ... NaN NaN NaN 3.7 43.5 NaN NaN NaN 50.703959 NaN
3 BG 20.2 NaN 65.828506 NaN 9.0 NaN NaN NaN NaN ... NaN NaN NaN 1.8 0.7 NaN NaN NaN 23.100215 NaN
4 VE 18.0 NaN 65.627593 NaN 1.0 NaN NaN NaN NaN ... NaN NaN NaN 37.1 88.7 NaN NaN 14.755198 10.534170 NaN

5 rows × 84 columns

Data Quality Scores (DQS)

To check for dataquality two types of checks were done on both the datasets.

  1. Missing data: any cell in the dataset which was empty was counted as missing data. This is checked for all features in both the datasets.
  2. Invalid data: validation checks were done on many (9 fields in the Starbucks dataset, and all 83 fields in the Worldbank dataset). These checks were both generic (validation for numeric data, absence of special characters, validation of timestamp data, latitude and longitude validation, timezone offset and so on and so forth) as well context specific (store-id has to be unique).

Two metrices are derived to quantify missing data and invalid data. These are raw score and adjusted score.

  1. Raw Score for Missing Data: this is the percentage of data that is not missing, simply speaking this is the count of non-empty cells Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of missing data. This score is available for both Starbucks and WorldBank datasets.

  2. Adjusted Raw Score for Missing Data: this is the percentage of data that is not missing for Mandatory Features (i.e. features without which the record would have to be discarded), simply speaking this is the count of non-empty cells in mandatory columns Vs total number of cells expressed as a percentage. In case all features are mandatory then the adjusted raw score and the raw score are the same. The higher this score the cleaner the dataset is from the perspective of missing data. This score is available for both Starbucks and WorldBank datasets.

  3. Raw Score for Invalid Data: this is the percentage of data that is not invalid, simply speaking this is the count of cells containing valid data Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of invalid data. Validity checks are both generic and context specific as defined above. This score is available for both Starbucks and WorldBank datasets.

  4. Adjusted Score for Invalid Data: this is the percentage of data that is not invalid for Mandatory Features (i.e. features without which the record would have to be discarded), simply speaking this is the count of cells containing valid data in mandatory columns Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of invalid data. Validity checks are both generic and context specific as defined above. This score is available for both Starbucks and WorldBank datasets.

Validity checks implemented for Starbucks dataset

Field Validity Check
Store Id Has to be unique for each row
Latitude/Longitude Longitude measurements range from 0° to (+/–)180°, Latitude from -90° to +90°
Timezone offset Has to be divisible by 15
Country code Has to be present in a country code list downloaded offline
Brand Has to be within one of 6 brands listed on Starbucks website http://www.starbucks.com/careers/brands
Store Number Has to follow a format XXXX-YYYY
Timzone name Has to have either "Standard Time" or "UTC" as part of name
Ownership Type Should not contain special characters
First seen Has to follow a date format

Validity checks implemented for Worldbank dataset

Field Validity Check
Country Code Should not contain special characters
All 83 features (WDIs) Should be numeric as all of them represent a numerical measure of the feature

DQS for Starbucks and WorldBank datasets

The following code fragment downloads a CSV file containing these scores from the Github repo of this project. These scrores have been calculated offline by running the program and uploading the results as part of the Github repo. The missing data raw score for the WorldBank data is very less ~34%, this is because for several parameters the 2015 data is not yet available, so in this case we will use the 2014 data as these are macro level indicators which do not see a drastic change year on year (this will be disussed more in the next phase of the project).


In [11]:
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs.csv'

df = pd.read_csv(DQS_URL)
df.head()


Out[11]:
Datasource Invalid_Data_Raw_Score Invalid_Data_Adjusted_Score Missing_Data_Raw_Score Missing_Data_Adjusted_Score
0 WorldBank 100.000000 100.000000 33.648990 33.64899
1 Starbucks 99.999233 99.999233 91.523031 100.00000

Feature Creation

Three (for now) new features are created using the datasets. These features are all added to the Starbucks dataset. These are described below:

Feature Format Explanation
Continent String Map the country code to the continent to which the country belongs. This feature enables answering basic questions like what % of stores exist in Asia as compared to North America for example. The country code to continent mapping is obtained with the help of two csv files downloaded offline (code to country mapping, country to continent mapping), first country code is mapped to country name and then country name is mapped to the continent name.
On Airport Boolean Does this store exists on an Airport? This feature helps answer questions like how many of the Starbucks in Europe exists on Airport for example. For now a store is considered to be on an airport if the name field or the street combined field for a store happens to have either one the following terms "Airport, Arpt, Terminal" contained in it. Going forward this logic will be replaced with an IATA API (see comments in code, wb.py file).
Ease of doing business Category String The ease of doing business index (obtained from the WorldBank data) for the country in which the store exists is mapped to a category viz. Very High (1 to 10), High (11 to 30), Medium (31, 90), Low (91 to 130), Very Low (131 and beyond) and finally Unknown for which countries where there is no data available. This feature provides insights like around 70% of Starbucks stores exist in countries which have an ease of doing business index between 1 to 10 (with 1 being the highest and 189 being the lowest). The ease of doing business index is obtained by looking up the country code in the WDI dataset to find the value for the ease of doing business index.

The following table shows a view of the modified dataset with new features added.


In [2]:
import pandas as pd
SB_W_NEW_FEATURES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/SB_data_w_features.csv'

df = pd.read_csv(SB_W_NEW_FEATURES_URL)

#show only some of the columns so that new feature can be shown without a scrollbar
cols = ['store_id', 'country', 'name', 'city', 'brand', 'continent', 'on_airport', 'eodb_category']
df[cols].head()


Out[2]:
store_id country name city brand continent on_airport eodb_category
0 1 CN Plaza Hollywood Hong Kong Starbucks Asia False M
1 6 CN Exchange Square Hong Kong Starbucks Asia False M
2 8 CN Telford Plaza Kowloon Starbucks Asia False M
3 13 CN Hong Kong Station Hong Kong Starbucks Asia False M
4 17 CN Pacific Place, Central Hong Kong Starbucks Asia False M

Some notes about the code and how to run the SBS program

This section provides information about how to run the code, how is the code organized amongst different modules and what are the output files generated.

How to run the code

When you unzip the zip file from Blackboard Github, you will see a folder called sb_study created which contains all the code, helper files and even an output folder (the contents of the output folder get rewritten on every run of the SBS program). As a pre-requisite the machine on which this program is being run is required to have Python 3.5.2|Anaconda 4.1.1 installed. Run the program by simply saying

python main.py

The code will take about 2 to 3 minutes to run completely. As the program runs it would spit out traces on the console and these traces are also automatically being logged into a file called SBS.log created in the output directory (also created by the SBS program in the current working directory if not already present). The program would create a directory called output and several sub-directories inside it to store the various artifacts created (csv files, plots etc.) as part of running the program.

The program can also be run in analysis only mode such that it assumes that the data has already been downloaded by running the program in the full mode first. In the analysis only mode the program runs various forms of analysis on the data and stores the created artificats under subdirectories in the output folder. This mode is useful if only the analysis needs to be redone and the entire datsset does not need to be downloaded again. To run the program in analysis mode use the following command line.

python main.py -a

Code organization and other files

Here is a listing of everything that is included alongwith a brief description.

File name Notes
main.py Everything starts from here, as the name suggests this is the main file of the package. This is what we run. It is organized to run a series of functions like a data science pipeline, for example get_data, clean_data and so on. It internally invokes the wb (for world bank) module and the sb (for Starbucks) module to do the same tasks for the Worldbank and Starbucks datasets respectively.
README.md This is the readme markdown file for Github.
LICENCE.md This is the license markdown file for Github. The SBS project is available under MIT license.
countries.csv A CSV file which provides the country to continent mapping.
data.csv A CSV file which provides the country code to country name mapping. Downloaded offline.
WDI_Series.csv A CSV file containing the name and description of World Development Indicators (WDI) selected for use in this project. Created offline.
wb Directory for the world bank submodule
wb\wb.py Main file for the WorldBank submodule. It contains functions to run the data science pipeline for the WorldBank dataset.
wb\wb_check_quality.py This file performs the data quality checks i.e. missing data and invalid data checks for the WorldBank data. It internally invokes the utils.py module for some of the data validation checks.
sb\sb.py Main file for the Starbucks submodule. It contains functions to run the data science pipeline for the Starbucks dataset. It also contains code for feature generation.
sb\sb_check_quality.py This file performs the data quality checks i.e. missing data and invalid data checks for the Starbucks data. It internally invokes the utils.py module for some of the data validation checks. Many of the data validation checks done are specific to the Starbucks data set and are included in this file.
common Directory for the common submodule.
common\utils.py This file contains utility function that are common to both the modules and so this is the once place to implement these functions. Several data validation checks are also included in this file.
common\logger.py This file creates a logger object using the Python logging module. It sets the format specifier so that the trace messages have the desired format (timestamp, module name, etc.) and also takes care of attaching two handlers, one for the console and one for the log file so all traces go to both the console and the log file.
common\globals.py This file contains the common constants and even some variables that are common across modules. Think of this is a common header file.
output This is the folder which contains all the output (including log file) produced by the SBS program. The SBS program creates this directory if not present.
output\SBS.log This is the log file created upon running the program. This file also contained more detailed DQS metrics i.e. missing data and invalid data related errors for individual features.
output\SB_data_as_downloaded.csv This is the CSV version of the Starbucks data downloaded via the Socrata API.
output\SB_data_w_features.csv This is the CSV version of Starbucks data plus with the new features added to it (last three columns) as part of the feature creation activity.
output\WB_data_as_downloaded.csv This is the CSV version of the WorldBank data downloaded via the WB API.
output\dqs.csv This is the CSV file which contains the data quality score for both the datasets.

A note about the structure of the output folder.


The output folder contains several csv file which correspond to data as downloaded, cleaned data and combined data.

  • WDI_SB.csv: this is the combined dataset, all analysis is run on this dataset.
  • DQS: this folder contains various dqs generated at different stages, contains multiple files, the contents are self-explnatory. The Jupyter notebooks displays the contents of these files.
  • EDA: this folder ontains artifacts from the exploratory data analysis, which include csv files and plots, the contents are self-explnatory. The Jupyter notebooks displays the contents of these files.
  • scatter: this folder contains various scatter plots generated for the combined dataset. Specifically, the plots between number of Starbucks store and the WDI indicators.
  • regression: this folder contain various artifacts (csv files and plots) generated as part of regression.
  • clustering: this folder contains various artifacts (csv files and plots) generated as part of clustering.
  • classification: this folder is a placeholder for artificats created as part of classification. Currently the classification results can be seen in the Jupyter notebook and the SBS.log file.
  • association_rules: this folder contains various artifacts (csv files only) generated as part of association rule mining. These csv files are included as part of the Jupyter notebook.

Combinining the two datasets

The WorldBank WDI indicators dataset and the Starbucks datasets are combined together for hypothesis testing and machine learning described later in this notebook. To combine the two datasets we first filter out all the countries from the WorldBank dataset which do not have a Starbucks store i.e. we keep only those countries in the combined dataset that have at least one Starbucks store. This combined dataset is then cleaned (described later in this document) and all analysis is run on it. There are 72 countries in the world where Starbucks is present so the combined dataset contains 72 rows and 48 columns (more on the column count later). The following fields computed from the Starbucks dataset are added to the combined dataset:

Feature Description
Num.Starbucks.Stores Total number of Starbucks stores in the country
Num.Starbucks.Stores.Categorical Total number of Starbucks stores in the country categorized as VL (VeryLow), L (Low), Medium (M), High(H), Very High(VH)
Starbucks.Store.Density Number of Starbucks stores per 100,000 people
Starbucks.Store.Density.Categorical Number of Starbucks stores per 100,000 people categorized as VL (VeryLow), L (Low), Medium (M), High(H), Very High(VH)
Exists.inMultiple.Cities True if Starbucks stores exists in multiple cities in a country, False otherwise
Ownership.Type.Mixed True if multiple ownership types exists for Starbucks stores in a country (Franchisee owned, Starbucks owned etc).
Continent Name of the continent in which the country exists in
MultipleBrands True if Starbucks exists as multiple brands in the country, False otherwise

The shape of the combined dataset is (72,48). This includes 35 WDI (World Development Indicators) from the WorldBank datset.

For converting continous columns to categorical a quantile (percentile) based binning strategy is used.

Range Category
<= 10th percentile Very Low (VL)
<=20th percentile Low (L)
<=60th percentile Medium (M)
<= 80th percentile High (H)
beyond 80th percentile Very High (VH)

Exploratory Data Analysis

This section describes various types of exploratory analysis run on the two datasets as well as on the combined dataset before getting into the hypothesis building and machine learning part.

Basic Statistical Analysis and data cleaning

This section provides basic statistical analysis for both the WorldBank dataset and the StarBucks datasets. The following boxplot provides a quick overview into the distribution of Starbucks stores across continents. Some quick observations from the box plot are described below:

  1. There is a huge variation in the distribution of Starbucks stores in the continent of North America. This is on expected lines since other than the U.S. and Canada all other countries are islands. Also the North American continent contains the country with the most number of Starbucks stores (the U.S. ofcourse, no surprises there).

  2. There is minimum variation in the distribution of Starbucks stores in the continent of Oceania since it consists of only two Countries which have Starbucks stores i.e. Australia and New Zealend.

  3. The European continent has the most even distributed around the median.

Mean, mode, median, standard deviation

The WorldBank dataset consists of all numeric values except for the country code, country name and the derived feature for ease of doing business (categorical). The country code and name are excluded from the statistical analysis since there is onnly one row per country in the dataset. The statistical measures for each feature are listed below. Some of the features show NaN for the statistical measures, that is because there is no data at all for these features (they are subsequently removed from the dataset, described later).


In [19]:
import pandas as pd
WB_EDA_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WB_EDA.csv'
df = pd.read_csv(WB_EDA_URL)
df


Out[19]:
feature mean median mode stddev
0 IC.TAX.METG 1.467218e+00 1.435000e+00 0 5.794943e-01
1 SL.EMP.TOTL.SP.FE.NE.ZS 4.530958e+01 4.750000e+01 0 1.220095e+01
2 IT.NET.USER.P2 4.755657e+01 4.688564e+01 0 2.769172e+01
3 SP.POP.TOTL 2.970610e+08 1.009667e+07 0 9.376648e+08
4 VC.PKP.TOTL.UN 4.955647e+03 7.930000e+02 0 6.086792e+03
5 IC.GOV.DURS.ZS 8.830681e+00 8.956250e+00 0 4.266066e+00
6 IC.REG.PROC 6.938789e+00 7.000000e+00 0 2.928893e+00
7 IC.LGL.CRED.XQ 5.110327e+00 5.000000e+00 0 2.741384e+00
8 EG.ELC.ACCS.UR.ZS NaN NaN 0 NaN
9 IC.BUS.NDNS.ZS 3.674968e+00 2.290769e+00 0 4.270679e+00
10 EN.URB.LCTY.UR.ZS 2.988715e+01 2.684948e+01 0 1.682697e+01
11 SL.EMP.TOTL.SP.MA.NE.ZS 6.361581e+01 6.384569e+01 0 8.498745e+00
12 IC.ELC.DURS 3.351995e+01 3.074444e+01 0 2.697875e+01
13 SP.URB.TOTL.IN.ZS 5.862963e+01 5.837484e+01 0 2.313314e+01
14 NY.GNP.PCAP.CD 1.267481e+04 5.585000e+03 0 1.769632e+04
15 SL.TLF.ACTI.1524.NE.ZS 4.255292e+01 4.113231e+01 0 1.256837e+01
16 IC.WRH.DURS 1.611548e+02 1.540000e+02 0 8.145088e+01
17 IC.TAX.TOTL.CP.ZS 4.088493e+01 3.927222e+01 0 1.914462e+01
18 SI.POV.GINI 4.012033e+01 4.172000e+01 0 9.002209e+00
19 SI.POV.2DAY 1.790452e+01 1.129000e+01 0 2.032023e+01
20 IC.ELC.TIME 9.773877e+01 8.700000e+01 0 6.537607e+01
21 IC.TAX.OTHR.CP.ZS 8.843270e+00 3.000000e+00 0 1.803673e+01
22 IC.ISV.DURS 2.597505e+00 2.581818e+00 0 1.055514e+00
23 FB.ATM.TOTL.P5 4.842994e+01 4.170072e+01 0 4.303515e+01
24 SL.UEM.NEET.ZS 1.347953e+01 1.228000e+01 0 6.136543e+00
25 EG.ELC.RNEW.ZS NaN NaN 0 NaN
26 IQ.WEF.PORT.XQ 3.958079e+00 3.904944e+00 0 1.123580e+00
27 IC.TAX.GIFT.ZS 1.529794e+01 1.701000e+01 0 8.678293e+00
28 IQ.WEF.CUST.XQ 3.983466e+00 3.813284e+00 0 8.156708e-01
29 EN.ATM.PM25.MC.ZS NaN NaN 0 NaN
... ... ... ... ... ...
53 BX.KLT.DINV.WD.GD.ZS 4.660374e+00 2.556654e+00 0 8.348388e+00
54 TX.VAL.TECH.CD 7.403359e+10 2.120983e+08 0 2.555456e+11
55 EN.CLC.MDAT.ZS NaN NaN 0 NaN
56 SL.EMP.WORK.ZS 8.124434e+01 8.417780e+01 0 9.151259e+00
57 SH.STA.ACSN.UR 7.898257e+01 8.797773e+01 0 2.325356e+01
58 WP_time_01.1 5.330584e+01 5.126150e+01 0 3.125552e+01
59 SL.IND.EMPL.ZS 2.338136e+01 2.320000e+01 0 6.635862e+00
60 EN.POP.SLUM.UR.ZS 4.404918e+01 4.300000e+01 0 2.234641e+01
61 SP.POP.1564.TO.ZS 6.374940e+01 6.536425e+01 0 6.471699e+00
62 SE.ADT.LITR.ZS 8.416093e+01 9.397165e+01 0 1.876130e+01
63 FI.RES.TOTL.DT.ZS 9.210792e+01 4.777428e+01 0 2.829776e+02
64 IC.BUS.DISC.XQ 5.470902e+00 5.487101e+00 0 2.288636e+00
65 CM.MKT.TRAD.GD.ZS 6.147025e+01 1.907471e+01 0 1.006608e+02
66 SL.SRV.EMPL.ZS 6.585253e+01 6.905000e+01 0 1.252427e+01
67 NE.CON.PRVT.KD.ZG 2.794388e+00 3.038269e+00 0 4.618765e+00
68 AG.LND.TOTL.UR.K2 NaN NaN 0 NaN
69 FP.CPI.TOTL.ZG 3.764240e+00 1.835994e+00 0 9.904829e+00
70 ST.INT.RCPT.XP.ZS 1.426136e+01 6.996199e+00 0 1.748747e+01
71 IC.REG.COST.PC.ZS 2.671969e+01 1.397500e+01 0 3.990623e+01
72 LP.LPI.OVRL.XQ 2.867913e+00 2.739469e+00 0 5.229044e-01
73 IC.FRM.BRIB.ZS 2.187780e+01 2.390000e+01 0 1.124342e+01
74 IC.FRM.DURS 2.277239e+01 1.909048e+01 0 1.139392e+01
75 IT.CEL.SETS.P2 1.063206e+02 1.078952e+02 0 3.922009e+01
76 IC.IMP.COST.CD 1.883703e+03 1.480000e+03 0 1.511363e+03
77 IC.EXP.COST.CD 1.564382e+03 1.270810e+03 0 1.127318e+03
78 IC.TAX.LABR.CP.ZS 1.572635e+01 1.405870e+01 0 9.984998e+00
79 IC.ELC.OUTG 7.778588e+00 5.650000e+00 0 8.081534e+00
80 EN.ATM.CO2E.EG.ZS NaN NaN 0 NaN
81 DT.DOD.PVLX.GN.ZS 2.090316e+01 1.743921e+01 0 1.352618e+01
82 Ease.Of.Doing.Business 0.000000e+00 0.000000e+00 VeryLow:70 0.000000e+00

83 rows × 5 columns

The Starbucks dataset consists of all categorical features, therefore only a mode calculation is valid. Several features such as street names are excluded from the mode calculation as well. There are some intresting observations to be made

  1. The maximum number of Starbucks store in a single city are in a city outside of the U.S.
  2. The maximum number of Starbucks store in the U.S. and in the Eastern Timezone.
  3. The maximum number of Starbucks stores (about 70%) occur in countries with very high ease of doing business.

In [20]:
import pandas as pd
SB_EDA_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/SB_EDA.csv'
df = pd.read_csv(SB_EDA_URL)
df


Out[20]:
feature mean median mode stddev
0 brand 0 0 Starbucks:24733 0
1 city 0 0 上海市:498 0
2 country 0 0 US:13486 0
3 ownership_type 0 0 CO:11777 0
4 timezone 0 0 Eastern Standard Time:5460 0
5 continent 0 0 North America:15568 0
6 on_airport 0 0 False:24641 0
7 eodb_category 0 0 VH:15549 0

Outlier Detection

Outlier detection is performed on the WorldBank dataset. We determine outliers using a simple rule, anything outside of the 3rd standard deviation is an outlier. The results of outlier detection are provided below. Note that while there are values which get categorized as outliers but they are still valid values (as understood by human inspection of the data) therefore no outlier handling as such (smoothning, removal etc) is done for these values. Also given that these are valid values they cannot be ignored as they may have a significant role in the machine learning algorithms.

No outlier detection is performed on the Starbucks dataset as it contains only categorical values (values outside of well defined valid values set are considered invalid and that analysis has already been performed as part of the data quality detection phase).


In [21]:
import pandas as pd
WB_OUTLIER_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WB_outliers.csv'
df = pd.read_csv(WB_OUTLIER_URL)
df


Out[21]:
dataset entry field value 3rdStdDev
0 WB Early-demographic dividend SP.POP.TOTL 3.122703e+09 2.812994e+09
1 WB Middle income SP.POP.TOTL 5.521157e+09 2.812994e+09
2 WB IBRD only SP.POP.TOTL 4.542581e+09 2.812994e+09
3 WB Low & middle income SP.POP.TOTL 6.159443e+09 2.812994e+09
4 WB IDA & IBRD total SP.POP.TOTL 6.183635e+09 2.812994e+09
5 WB World SP.POP.TOTL 7.346633e+09 2.812994e+09
6 WB Bhutan IC.GOV.DURS.ZS 2.880000e+01 1.279820e+01
7 WB Equatorial Guinea IC.REG.PROC 1.800000e+01 8.786678e+00
8 WB Venezuela, RB IC.REG.PROC 1.700000e+01 8.786678e+00
9 WB Philippines IC.REG.PROC 1.600000e+01 8.786678e+00
10 WB New Zealand IC.BUS.NDNS.ZS 1.663000e+01 1.281204e+01
11 WB Malta IC.BUS.NDNS.ZS 1.726000e+01 1.281204e+01
12 WB Hong Kong SAR, China IC.BUS.NDNS.ZS 3.130000e+01 1.281204e+01
13 WB Hong Kong SAR, China EN.URB.LCTY.UR.ZS 1.000000e+02 5.048091e+01
14 WB Macao SAR, China EN.URB.LCTY.UR.ZS 9.945780e+01 5.048091e+01
15 WB Singapore EN.URB.LCTY.UR.ZS 1.000000e+02 5.048091e+01
16 WB Ethiopia IC.ELC.DURS 1.943000e+02 8.093626e+01
17 WB Luxembourg NY.GNP.PCAP.CD 7.700000e+04 5.308897e+04
18 WB Switzerland NY.GNP.PCAP.CD 8.418000e+04 5.308897e+04
19 WB Qatar NY.GNP.PCAP.CD 8.543000e+04 5.308897e+04
20 WB Norway NY.GNP.PCAP.CD 9.382000e+04 5.308897e+04
21 WB Macao SAR, China NY.GNP.PCAP.CD 7.630000e+04 5.308897e+04
22 WB Barbados IC.WRH.DURS 4.420000e+02 2.443526e+02
23 WB Cyprus IC.WRH.DURS 6.170000e+02 2.443526e+02
24 WB Cambodia IC.WRH.DURS 6.520000e+02 2.443526e+02
25 WB Brazil IC.WRH.DURS 4.257000e+02 2.443526e+02
26 WB Zimbabwe IC.WRH.DURS 4.480000e+02 2.443526e+02
27 WB Argentina IC.TAX.TOTL.CP.ZS 1.374000e+02 5.743387e+01
28 WB Comoros IC.TAX.TOTL.CP.ZS 2.165000e+02 5.743387e+01
29 WB Bangladesh IC.ELC.TIME 4.289000e+02 1.961282e+02
... ... ... ... ... ...
118 WB Macao SAR, China ST.INT.RCPT.XP.ZS 9.367504e+01 5.246241e+01
119 WB South Sudan IC.REG.COST.PC.ZS 3.301000e+02 1.197187e+02
120 WB Central African Republic IC.REG.COST.PC.ZS 2.040000e+02 1.197187e+02
121 WB Haiti IC.REG.COST.PC.ZS 2.353000e+02 1.197187e+02
122 WB Chad IC.REG.COST.PC.ZS 1.504000e+02 1.197187e+02
123 WB Djibouti IC.REG.COST.PC.ZS 1.681000e+02 1.197187e+02
124 WB Mauritania IC.FRM.DURS 6.460000e+01 3.418175e+01
125 WB Kuwait IT.CEL.SETS.P2 2.317632e+02 1.176603e+02
126 WB Hong Kong SAR, China IT.CEL.SETS.P2 2.288316e+02 1.176603e+02
127 WB Macao SAR, China IT.CEL.SETS.P2 3.244408e+02 1.176603e+02
128 WB Zambia IC.IMP.COST.CD 7.060000e+03 4.534090e+03
129 WB South Sudan IC.IMP.COST.CD 9.285000e+03 4.534090e+03
130 WB Congo, Rep. IC.IMP.COST.CD 7.590000e+03 4.534090e+03
131 WB Tajikistan IC.IMP.COST.CD 1.065000e+04 4.534090e+03
132 WB Uzbekistan IC.IMP.COST.CD 6.452000e+03 4.534090e+03
133 WB Chad IC.IMP.COST.CD 9.025000e+03 4.534090e+03
134 WB Zambia IC.EXP.COST.CD 5.165000e+03 3.381954e+03
135 WB South Sudan IC.EXP.COST.CD 5.335000e+03 3.381954e+03
136 WB Tajikistan IC.EXP.COST.CD 9.050000e+03 3.381954e+03
137 WB Afghanistan IC.EXP.COST.CD 5.045000e+03 3.381954e+03
138 WB Central African Republic IC.EXP.COST.CD 5.490000e+03 3.381954e+03
139 WB Uzbekistan IC.EXP.COST.CD 5.090000e+03 3.381954e+03
140 WB Kazakhstan IC.EXP.COST.CD 5.285000e+03 3.381954e+03
141 WB Chad IC.EXP.COST.CD 6.615000e+03 3.381954e+03
142 WB Belgium IC.TAX.LABR.CP.ZS 4.940000e+01 2.995499e+01
143 WB France IC.TAX.LABR.CP.ZS 5.350000e+01 2.995499e+01
144 WB Nigeria IC.ELC.OUTG 3.280000e+01 2.424460e+01
145 WB Papua New Guinea IC.ELC.OUTG 4.190000e+01 2.424460e+01
146 WB Bhutan DT.DOD.PVLX.GN.ZS 6.815542e+01 4.057852e+01
147 WB Cabo Verde DT.DOD.PVLX.GN.ZS 6.770737e+01 4.057852e+01

148 rows × 5 columns

Bin the data

As part of the analysis that follows several continous/discrete features were binned (as described below this is needed for association rule mining, feature selection etc.). Here we show one specific feature IC.BUS.EASE.XQ which quantifies the ease of doing business in a country (1 being highest, larger the value lower the ease of doing business) into a categorical value. Numeric values get mapped to categories as follows:

Range Category
[1,11) VeryHigh
[11,31) High
[31,91) Medium
[91, 131) Low
[131, beyond VeryLow

The benefits of binning this feature is for exploratory data analysis to answer questions like what % of Starbucks store are in countries with high or very high ease of business. The method used provides a simple strategy to bin the data, although it would require a subject matter expert to say what the most appropriate bin edges are.

Here is an excerpt from the WorldBank data showing the new column created as a result of binning along with the original feature (first 5 rows shown).


In [22]:
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/WB_data_w_features.csv'
df = pd.read_csv(WB_URL)
df[['name', 'IC.BUS.EASE.XQ', 'Ease.Of.Doing.Business']].head()


Out[22]:
name IC.BUS.EASE.XQ Ease.Of.Doing.Business
0 Belgium 43.000000 Medium
1 Equatorial Guinea 180.000000 VeryLow
2 Early-demographic dividend 117.868852 Low
3 Small states 107.354167 Low
4 Bangladesh 174.000000 VeryLow

Handling missing values

There are not a lot of missing values in the Starbucks dataset (the adjusted data quality score for missing values is a 100% which means all the featurs that we care about have no missing values, see section on DQS above). For the WorldBank dataset has a lot of missing though, DQS for missing values is ~ 33.65%. The missing values in the WorldBank dataset are handled as follows:

  1. The WorldBank data is updated every year and many times there is a timelag in the data being available for a year. Also since these indicators are macro level indicators so the year on year change is not drastic. As a first step, all the values that are missing for the latest year are filled with the values from the previous year whenever the previous year's data is applicable.

  2. Define a feature density threshold of 90% and delete all features which are less than 90% filled.

  3. Finally the analysis is to be performed on the combined dataset, as described above the combined dataset is obtained by joining the WorldBank and Starbucks datasets only for those countries that exist in the Starbucks dataset. This intersection results in a dataset which is pretty full (99.8%).

    • Any missing values are handled by replacing them with the mean value of the feature.

The DQS metrics for the original datasets as well as the combined dataset after cleaning are shown below.


In [24]:
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs/dqs_after_cleaning.csv'
df = pd.read_csv(DQS_URL)
#This is the DQS after Step 1 described above
df


Out[24]:
Datasource Combined_Score Invalid_Data_Raw_Score Invalid_Data_Adjusted_Score Missing_Data_Raw_Score Missing_Data_Adjusted_Score
0 WorldBank 76.92776 100.00000 100.00000 53.855519 53.855519
1 Starbucks 99.99981 99.99962 99.99962 91.529921 100.000000

In [25]:
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs/dqs_combined_dataset.csv'
df = pd.read_csv(DQS_URL)
#This is the DQS after Step 3 described above (Combined dataset, actually used for analysis)
df


Out[25]:
dataset before_cleaing after_cleaning
0 WDI+SB 61.20915 99.812312

Data cleaning for the combined dataset

The combined dataset is cleaned for missing values as described in the previous section.

  1. Missing values are handled as described above (take missing data from 2014, remove features for which density is less than 90%, many rows get removed when we combined the WorldBank data with Starbucks data).
  2. There are no invalid values in the WorldBank dataset (see Data Issues and Data Quality Score sections above). The invalid values from the Starbucks dataset (some lat/long values were missing, names of some streets could not be read due to errors with unicode handling) do not impact the overall analysis and are therefore ignored, see adjusted DQS for invalid data.

Histograms and Correlations

This section provides histograms and corelations from the individual datasets where appropriate and also from the combined dataset used for analysis.

Histograms from WorldBank dataset

The following figure shows histogram of three different features from the WorldBank dataset.

The observations from these histograms are described below.

Feature Description Observations
IS.BUS.EASE.XQ Ease of doing business index (1=most business-friendly regulations) Almost uniformaly distributed except for the 100 to 150 range (indicating more concentration of countries in that range).
SP.URB.GROW Urban population growth (annual %) Almost normally distributed.
WP15163_4.1 Mobile account (% age 15+) Almost looks like an EL (Exponential logarithmic distribution).

Histograms from the combined dataset

The following figure shows histogram of three different features from the Combined dataset. The value of the coorelation coefficent is also shown.

The observations from these histograms are described below.

Feature Description Observations
IT.NET.USER.P2 Internet users (per 100 people) Looks like a skewed left distribution. Most countries with Starbucks stores have a very high percentage of internet users, which makes sense if one visualizes a typical coffee shop.
Num.Starbucks.Stores Number of Starbucks stores. Most countries (infact 69 out of 73 which have Starbucks) have kess thana 1000 stores, which makes sense but doesnt tell as much. A better representation would be to increase the number of bins in this historgram.
SP.PO.TOTAL Total population The outliers in this chart towards the extreme right probably represent India and China.
ST.IN.ARVL International tourism, number of arrivals Intrestingly, it appears that most Starbucks stores are not located in countries with very high tourist footfals.

Scatter plots and correlation coefficient WorldBank dataset

The following figure shows the scatter plot for three features in the WorldBank dataset.

From the scatter plots we can observe that there is a negative correlation between ease of doing business and number of Internet users in a country, but, since the ease of doing business is expressed in a way that lower the number greater the ease so therefore it follows that ease of doing business and number of Internet users are positively correlated.

Ease doing business is also positvely correlated to per capita GDP. There also seems to be a positive correlation between number of internet users and per capita GDP.

It is important to mention here, the oft repeated phrase correlation does not imply causation.

The following table shows the correlation coefficient between these features and the value of r corrobrates the observations from the scatter plot above.


In [26]:
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WDI_pearson_r.csv'
df = pd.read_csv(WB_URL)
df


Out[26]:
Dataset feature1 feature2 r
0 WB IC.BUS.EASE.XQ SL.GDP.PCAP.EM.KD -0.557920
1 WB IC.BUS.EASE.XQ IT.NET.USER.P2 -0.831488
2 WB SL.GDP.PCAP.EM.KD IT.NET.USER.P2 0.741431

Scatter plots and correlation coefficient the combined dataset

The following figure shows the scatter plot for three features in the WorldBank dataset.

From the scatter plots we can observe Number of Starbucks stores has somewhat of a positive non-linear relationship with the number of international tourist arrivals, total population and number of people having access to the Internet. This is further clarified when we do polynomial regression between number of Starbucks stores and number of international tourist arrivals.

The following table shows the correlation coefficient between these features and the value of r corrobrates the observations from the scatter plot above.


In [27]:
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/Combined_r.csv'
df = pd.read_csv(WB_URL)
df


Out[27]:
Dataset feature1 feature2 r
0 combined Num.Starbucks.Stores IT.NET.USER.P2 0.028446
1 combined Num.Starbucks.Stores ST.INT.ARVL 0.521229
2 combined Num.Starbucks.Stores SP.POP.TOTL 0.263139
3 combined IT.NET.USER.P2 ST.INT.ARVL 0.087930
4 combined IT.NET.USER.P2 SP.POP.TOTL -0.324875
5 combined ST.INT.ARVL SP.POP.TOTL 0.311971

Cluster Analysis

Cluster analysis was conducted to find patterns within data (from the combined dataset). Recall that the combined dataset is created by filtering out countries which do not have Starbucks presence, this brings a Starbucks specific dimension to this dataset. The presence of Starbucks stores (lets say in terms of number of stores in a country or any other feature derived from the Starbucks dataset) is the dependant variable in this analysis, therefore any patterns we find in the clusters that are created would be examined in that light i.e. do the cluster member show an affinity to a particular value of the dependant variable.

Before doing clustering a Principal Component Analysis (PCA) is first conducted to transform the data into 3 dimensions. PCA in 3 dimensions helps because the clusters can the be visualized in 3D space. Three different types of clustering mechanisms are used viz. KMeans, DBSCAN and Hierarchical (Agglomerative) clustering. This is described in detail in the next few sections, lets examine the PCA results first.

The following figure shows the result of PCA in 3 dimensions. Note that PCA only helps to spatially distinguish the data, but seeing the clusters without a spectral differentiation is difficult (which is what is provided by the clustering algorthms).

KMeans, DBSCAN and Heirarchical Clustering

The following figures show the result of 3 clustering techniques. It appeas that KMeans(k=5) provides the best custering.

Observations from clustering

  1. The plots above indicate that KMeans has the best clusters, although it has to be said that the clusters are not very clear and spaced out even with KMeans which indicates that there may not be very clear patterns in the data.
  2. Value of k=5 was choosen as it matches the bins that we want to create for classifying number of Starbucks stores as Very Low, Low, Medium, High and Very high.
  3. Vectors at the center of the clusters (KMeans) are centroids and the points in a cluster are all placed around it.
  4. The hierarchical and DBSCAN clustering produce similar results as can be observed just by visual inspection of the plots, the KMeans clusters seem to be much more well formed.

Specific observation from KMeans

The following table shows a filtered list of labels from each of the clustering algorithms attached to each entry in the dataset. The KMeans labels does seem to show a pattern in countries that are grouped together.

  1. KMeans label=4 groups together U.S. and China in a cluster of their own,which should not be a surprise. Ofcourse we can see any number of similarities that could have caused this for example GDP, tourist arrival etc. Also from a Starbucks perspective, these countries represent very high number of Starbucks stores (top 2).

  2. KMeans label=3 groups together Brazil, Russia and other countries of the erstwhile Russian federation. From a Starbucks perspective these countries represent very similar Starbucks store density (number of stores per 100,000 people).

  3. KMeans label=2 groups together mostly Asian, African and some South American countries which have low to medium ease of business but medium to high population.

Other labels can also be explained in a similar way.


In [28]:
import pandas as pd
Combined_W_Labels_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/clustering/WDI_SB_w_labels.csv'
df = pd.read_csv(Combined_W_Labels_URL)
df[['name', 'KMeans_labels', 'SL.GDP.PCAP.EM.KD','Ease.Of.Doing.Business','ST.INT.ARVL.Categorical','SP.POP.TOTL.Categorical','Num.Starbucks.Stores','Starbucks.Store.Density']].head(10)


Out[28]:
name KMeans_labels SL.GDP.PCAP.EM.KD Ease.Of.Doing.Business ST.INT.ARVL.Categorical SP.POP.TOTL.Categorical Num.Starbucks.Stores Starbucks.Store.Density
0 Belgium 0 98644.171875 Medium M M 19 0.168354
1 Trinidad and Tobago 2 61845.410156 Medium VL L 1 0.073525
2 Oman 3 85342.617188 Medium L L 12 0.267228
3 Aruba 3 65623.968183 NaN VL VL 3 2.887697
4 Luxembourg 0 201747.796875 Medium VL VL 2 0.351077
5 Lebanon 1 42497.589844 Low L M 29 0.495664
6 El Salvador 1 18404.509766 Medium VL M 10 0.163223
7 Romania 3 37818.410156 Medium M H 25 0.126056
8 Costa Rica 3 30870.919922 Medium M M 11 0.228792
9 Argentina 1 31734.800781 Low M H 105 0.241842

Association Rules / Frequent Itemset Mining Analysis

Association rule mining is run on the combined dataset to find tuples of various feature values that occur together in the dataset and draw inferences from that. Most of the data in the combined dataset is continous or discrete, to run association rule mining the first step that is needed is to convert data into categorical values. The combined dataset contains 40 odd features, so instead of converting the entire dataset to categorical, some selected features are converted and association rule mining is run for those features.

Continous/discrete features are converted into categorical by binning them by percentile values as described in previous sections. The dataset for this particular problem does not lend itself very well for this type of analysis (because large number of features are numeric). The conversion of numeric data to categorical in the context of this dataset is an art rather than a science (meaning what bin sizes are good for categorizing number of stores as low, medium or high, this is subjective) therefore we can get vastly different results by changing the criteria for converting numeric data to categorical (have 3 bins instead of 5 for example). Regression and classification algorithms discussed later are better suited for this data science problem.

To select which features would have a relationship with dependant variables that we want to predict we draw scatter plots of the dependant feature Vs all numeric features in the dataset and manually examine where a correlation is seen. Another technique used for finding out the most important feature using a machine learning technique (ExtraForestClassifier) and then using the top two most important features for conversion to categorical. The scatter plots are not being included in this report allthough they are available as part of the overall package that has been submitted. Similarly the list of important features can be seen as part of the logs generated on running the program.

Two different types of rule mining is done, one with a rule containing a single variable predicting the dependant variable and another one with two variables predicting the dependant variable.

Association rules with 2 features

The following table identifies all th association rules with two variables:

Feature Description Categorical values
Num.Starbucks.Stores.Categorical Number of Starbucks stores, categorical VL, L, M, H, VH
ST.INT.ARVL.Categorical International tourist arrivals, categorical VL, L, M, H, VH

The expectation is that very high influx of tourists should occur together with very high number of Starbucks stores. This does occur as this is the second most frequent rule as shown in the table below, however the support value for this rule is a low 0.125 and the confidence is 0.64. The rule that occurs most frequently is medium tourist influx corresponds to medium Starbucks stores, this rule occurs 15 times and has a support of 0.20 and confidence of 0.53. Expected to have a greater support level for the more frequent rules but this not happening probably because of two reasons a) there are only 72 countries with Starbucks stores so there is not a lot data and b) the conversion of numeric to cateogrical data in this case is not an exact science and can be tweaked to give better results (reduce the number of categories for example).

This is all summarized in the following table. The rules themselves are written in the following format R:({A})->B means if A is present then B is also present (with the frequency, support and confidence mentioned alongside to give a sense of probability).


In [29]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df.sort_values(by='frequency', ascending=False).head()


Out[29]:
rule frequency support confidence
0 R:({SBS_M})->TOURIST_ARR_M 15 0.208333 0.535714
18 R:({TOURIST_ARR_M})->SBS_M 15 0.208333 0.535714
29 R:({TOURIST_ARR_VH})->SBS_VH 9 0.125000 0.600000
12 R:({SBS_VH})->TOURIST_ARR_VH 9 0.125000 0.600000
19 R:({TOURIST_ARR_M})->SBS_H 7 0.097222 0.250000
### Association rules with 2 features filtered by 3 different support levels The following tables present association rules filterd by 3 different support levels. 1. support >= 0.05 and confidence >= 0.25 2. support >= 0.07 and confidence >= 0.25 3. support >= 0.1 and confidence >= 0.25

In [30]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.05_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1


Out[30]:
rule frequency support confidence
0 R:({SBS_M})->TOURIST_ARR_M 15 0.208333 0.535714
1 R:({SBS_VL})->TOURIST_ARR_VL 5 0.069444 0.555556
2 R:({SBS_H})->TOURIST_ARR_M 7 0.097222 0.500000
3 R:({SBS_H})->TOURIST_ARR_VH 4 0.055556 0.285714
4 R:({SBS_VH})->TOURIST_ARR_VH 9 0.125000 0.600000
5 R:({SBS_VH})->TOURIST_ARR_H 4 0.055556 0.266667
6 R:({TOURIST_ARR_M})->SBS_M 15 0.208333 0.535714
7 R:({TOURIST_ARR_M})->SBS_H 7 0.097222 0.250000
8 R:({TOURIST_ARR_VL})->SBS_VL 5 0.069444 0.625000
9 R:({TOURIST_ARR_L})->SBS_M 4 0.055556 0.571429
10 R:({TOURIST_ARR_VH})->SBS_VH 9 0.125000 0.600000
11 R:({TOURIST_ARR_VH})->SBS_H 4 0.055556 0.266667
12 R:({TOURIST_ARR_H})->SBS_M 6 0.083333 0.428571
13 R:({TOURIST_ARR_H})->SBS_VH 4 0.055556 0.285714

In [31]:
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.07_and_min_confidence_0.25.csv'
df2 = pd.read_csv(ASSOC_RULES_URL)
df2


Out[31]:
rule frequency support confidence
0 R:({SBS_M})->TOURIST_ARR_M 15 0.208333 0.535714
1 R:({SBS_H})->TOURIST_ARR_M 7 0.097222 0.500000
2 R:({SBS_VH})->TOURIST_ARR_VH 9 0.125000 0.600000
3 R:({TOURIST_ARR_M})->SBS_M 15 0.208333 0.535714
4 R:({TOURIST_ARR_M})->SBS_H 7 0.097222 0.250000
5 R:({TOURIST_ARR_VH})->SBS_VH 9 0.125000 0.600000
6 R:({TOURIST_ARR_H})->SBS_M 6 0.083333 0.428571

In [32]:
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.1_and_min_confidence_0.25.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df


Out[32]:
rule frequency support confidence
0 R:({SBS_M})->TOURIST_ARR_M 15 0.208333 0.535714
1 R:({SBS_VH})->TOURIST_ARR_VH 9 0.125000 0.600000
2 R:({TOURIST_ARR_M})->SBS_M 15 0.208333 0.535714
3 R:({TOURIST_ARR_VH})->SBS_VH 9 0.125000 0.600000

Association rules with 3 features

The following table identifies all th association rules with two variables:

Feature Description Categorical values
Num.Starbucks.Stores.Categorical Number of Starbucks stores, categorical VL, L, M, H, VH
ST.INT.ARVL.Categorical International tourist arrivals, categorical VL, L, M, H, VH
SP.POP.TOTL.Categorical Total population, categorical VL, L, M, H, VH

The expectation is that high influx of tourists and a high local population should occur together with high number of Starbucks stores. This does occur as this is the second most frequent rule as shown in the table below, however the support value for this rule is a very low 0.08 but the confidence is 0.75. The rule that occurs most frequently is medium tourist influx in a country with medium population corresponds to medium Starbucks stores, this rule occurs 7 times and has a support of 0.09 and confidence of 0.53. Expected to have a greater support level for the more frequent rules but this not happening probably because of two reasons a) there are only 72 countries with Starbucks stores so there is not a lot data and b) the conversion of numeric to cateogrical data in this case is not an exact science and can be tweaked to give better results (reduce the number of categories for example).

This is all summarized in the following table. The rules themselves are written in the following format R:({A,B})->C means if A and B present together does imply C (with the frequency, support and confidence mentioned alongside to give a sense of probability).


In [33]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df.sort_values(by='frequency', ascending=False).head()


Out[33]:
rule frequency support confidence
0 R:({POPT_M,TOURIST_ARR_M})->SBS_M 7 0.097222 0.538462
35 R:({POPT_VH,TOURIST_ARR_VH})->SBS_VH 6 0.083333 0.750000
9 R:({POPT_M,TOURIST_ARR_H})->SBS_M 4 0.055556 0.500000
20 R:({POPT_VL,TOURIST_ARR_VL})->SBS_VL 4 0.055556 0.800000
1 R:({POPT_M,TOURIST_ARR_M})->SBS_H 3 0.041667 0.230769

Association rules with 3 features filtered by 3 different support levels

The following tables present association rules filterd by 3 different support levels.

  1. support >= 0.05 and confidence >= 0.25
  2. support >= 0.07 and confidence >= 0.25
  3. support >= 0.1 and confidence >= 0.25

In [34]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.05_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1


Out[34]:
rule frequency support confidence
0 R:({POPT_M,TOURIST_ARR_M})->SBS_M 7 0.097222 0.538462
1 R:({POPT_M,TOURIST_ARR_H})->SBS_M 4 0.055556 0.500000
2 R:({POPT_VL,TOURIST_ARR_VL})->SBS_VL 4 0.055556 0.800000
3 R:({POPT_VH,TOURIST_ARR_VH})->SBS_VH 6 0.083333 0.750000

In [35]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.07_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1


Out[35]:
rule frequency support confidence
0 R:({POPT_M,TOURIST_ARR_M})->SBS_M 7 0.097222 0.538462
1 R:({POPT_VH,TOURIST_ARR_VH})->SBS_VH 6 0.083333 0.750000

In [36]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.1_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1


Out[36]:
rule frequency support confidence

Predictive Analysis

This section describes three hypothesis that have been developed using the combined WorldBank and Starbucks dataset and then tests each of the three hypothesis using parameteric and/or predictive algorithms.

Hypothesis 1: No difference in the average number of Starbucks stores in countries with very high number of international tourists Vs rest of the world.

This hypothesis is tested in two ways.

  1. Two sample T-test: A two sample T-test is used to find out if the means of two independant samples are different. The null hypothesis is that means of both the samples are the same. The two sample t-test only relies on sample means and does not need to know the population mean.

  2. Linear and Polynomial regression: ordinary least squares regression. Linear regression tries to predict the value of the dependant variable by using a linear combination of explanatory variables. Linear regression could be univariate (one explanatory variable) or multi variate. A polynomial regression tries to model the dependant variable using higher powers of the explanatory variable(s).

Using the two sample T-test for hypothesis 1

The combined dataset is used to filter out entries belonging to 'ST.INT.ARVL.Categorical' equal to 'VH' (for very high) and then another set of entries corresponding to 'ST.INT.ARVL.Categorical' value other than 'VH' so as to include the rest of the world. Python scipy module is used to calculate a T statistic and a corresponding p-value when comparing the two distributions. The p-value comes out to be ~0.17. This is a unusually large value and with a typical T-test alpha value of 0.05 it would not be possible to reject the null hypothesis but in this case since the distribution of the number of stores was assumed to be normal when it is known that it is not normal (see histogram in previous sections) so a higher alpha of 0.20 is choosen and this enables the null hypothesis that there is no difference to be rejected. This is in line with what is seen in the scatter plots for these two features.


In [37]:
import pandas as pd
T_TEST_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/regression/t_test_results.csv'
df1 = pd.read_csv(T_TEST_URL)
df1


Out[37]:
Hypothesis No difference between average number of Starbucks stores in countries with Very high and high number of international tourist Vs Rest of the world
0 T-statistic, p-value 1.452938,0.168245

Using Linear and Polynomial regression for hypothesis 1

In this method instead of using the categorical value of the number of international tourist arrivals the actual discrete value is used and it provided to a linear regression model and a polynmial regression model to predict the value of the number of Starbucks stores. The linear model provides an explained variance of 0.27 whereas the polynomial model performs much better by providing an exaplined variance of 0.33. Ideally the the explained variance should be close to 1. It is a matter of further investigation as to what else can be done to make the prediction better. The following scatter plots show linear and polynomial regression models in action, as can be seen from the plots the polynomial model provides a better fit. The polynomial model is a degree 3 polynomial (other degrees were also tried and 3 provided the best results). The outlier value seen towards the stop right is the number of stores in te U.S., it is an extreme value and further investigation would need to be done to figure out if a single variable can predict such a huge jump in value.

Hypothesis 2: The number of Startbucks stores in a country as a categorical value can be predicted using some WDI indicators

This hypothesis states that the number of Starbucks stores in a country expressed as a categorical value (Very High, High, Medium, Low, Very Low) can be predicted using WDI (World Development Indictors) from the WorldBank data. This is verified by running various classification algorithms and then selecting the on that provides the highest accuracy value as measured via cross validation.

The combined dataset contains 35 WDI features. All of these are not used for classification, the most important 15 features are choosen. The Extra Trees classifier which uses an ensemble of decision trees is used to provide a ranking for all the features and then the top ranked 15 features are used as inputs for the classification algorithms. Tests were done with 5, 10, 15, 30 and all 35 features as well, 15 most important features provided the best results in terms of accuracy.

The 15 most important features for determining the number of stores value as a categorical are:

Feature Description
TX.VAL.TECH.CD High-technology exports (current US dollars)
ST.INT.ARVL International tourism, number of arrivals
SP.POP.TOTL Total Population
BG.GSR.NFSV.GD.ZS Trade in services (percentage of GDP)
SL.GDP.PCAP.EM.KD GDP per person employed (constant 2011 PPP dollars)
IC.LGL.CRED.XQ Strength of legal rights index (0=weak to 12=strong)
BX.KLT.DINV.WD.GD.ZS Foreign direct investment, net inflows (percentage of GDP)
IC.EXP.COST.CD Cost to export (US$ per container)
NE.CON.PETC.ZS Household final consumption expenditure, etc. (percentage of GDP)
IC.REG.DURS Time required to start a business (days)
TX.VAL.OTHR.ZS.WT Computer, communications and other services (percentage of commercial service exports)
IC.REG.PROC Start-up procedures to register a business (number)
IC.BUS.EASE.XQ Ease of doing business
SH.STA.ACSN.UR Improved sanitation facilities, urban (percentage of urban population with access)
IT.NET.USER.P2 Internet users (per 100 people)

The above list is a very intresting list because all of these indicators seem to appear as very important factors that a business would consider. A lot of these are either economic or directly related to business activity so it intutively makes sense that the machine learning algorithm was choosing the right indicators.

Once the indicators were selected all of the following classification algorithms were applied on the data and the results in terms of accuracy and standard deviation of a K-fold cross validation was noted. Random forest turned out to be the best classifier in terms of accuracy.

The accuracy turned out to be in the 50 to 60 percent range but this is just the first pass at predicting. The indicators, how they are used, and finally the bins for the classification of the dependant variable, all are subject to be refined and this is not the final score. The hypothesis is considered valid although the accuracy value is lower than desirable. More investigation needs to be performed to bring the accuracy score up around the 80 percent range.

Description of classification algorithms used

All of the following algorithms were used for classification. All of these algorithms are available via the Scikit learn python package. A test train split was done using training data fraction as 20%. A 10-fold cross validation was done to determine the accuracy of various algorithms.

Algorithm Description Important notes Results (Accuracy (standard deviation)
kNN K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). Used scikit learn's defaults. 0.476667 (0.164688)
CART Classification and Regression Trees. Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision. The top most node is called root and is also the best predictor. Used scikit learn's defaults. 0.383333 (0.188709)
Naive Bayes The Naive Bayesian classifier is based on Bayes’ theorem with independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Used scikit learn's defaults. Does performn well alongiwth kNN and RandomForest for the combined dataset. 0.493333 (0.221008)
SVM A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the hyperplane are the support vectors. Used scikit learn's defaults (rbf kernel is the default, tried with other kernels all give the same result). Does not perform well for the dataset. 0.363333 (0.210000)
RandomForest A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). Used scikit learn defaults. Provides the best results. 0.510000 (0.201136)

Hypothesis 3: The number of Starbucks stores (discrete) in a country can be predicted using some WDI indicators.

This hypothesis stats that number of Starbucks stores in a country which is a discrete variable (as opposed to the categorical considered in hypothesis 2) can be predicted using some WDI indicators. The method used here is similar to hypothesis 2 in that first a set of three most important features are determined using the Extra Trees classifier and then these three features are fed into a Multivariate Polynomial Regression model to predict the dependant variable i.e. number of Starbucks store in a country. The three features used for this prediction are 'ST.INT.ARVL' (international tourist arrivals), 'TX.VAL.TECH.CD' (High-technology exports (current US dollars)), 'SP.POP.TOTL' (population of the country).

A multivariate polynomial model is choosen since there are more than one explanatory variables and the relationship with the dependant variable is definitely not linear as is known from the scatter plots (not included here but available as part of the package). A degree 3 polynomial works best as determined by experimenting with other values as well. An explained variance score of 0.99 is achieved, however the mean squared error is still very high at 27063. The MSE needs to be improved before the hypothesis can be considered valid, this needs further investigation in terms of feature selection and model used for prediction.

Other Hypothesis

A couple of other hypothesis were also tested as part of this phase.

  1. Whether ownership type of Starbucks store ina country should be mixed (Starbucks owned, franchisee owned or other) or just Starbucks owned. All classification models were tried and it was found that Random Forest and kNN both predict this with about 84% accuracy (stddev 12%).

Bringing it all together [Project1 + Project2]

At this time in the data science lifecycle a clean combined dataset exists and has been used to validate multiple hypothesis as described above. The focus of this phase was getting everything in place to run predictive algorithms, the next phase will focus on improving the accuracy of the predictions. The results of the predictions are included in the overall package (see regression folder) but are not being disucssed here at this time.

What does appear is that more than a regression problem the problem can be modeled as more of a clustering and classification problem. In other words, use clustering algorithms (like KMeans as described above) to see if countries which have no Starbucks stores appear in which clusters, for example if they appear in clusters corresponding say countries with a high number of Starbucks store then that is a prediction. Similarly, the RandomForest or NaiveBayes classifier (which provided between 50 to 61% accuracy) could be used to predict the category (low, medium, high etc) for number of Starbucks stores.

Another aspect to review is that maybe the categories for the number of Starbucks stores need to be reduced from 5 (Very high, high, medium, low, very low) to just 3 (high, medium, low) to improve the accuracy of predictions. These are all things to be considered for the next phase of the project.

Predictions based on Classification

As discussed in the previous section, it appears that classification is a better option for prediction than regression for this data science problem. In other words, it is possible to predict the number of Starbucks stores in a country as a categorical variable (i.e. VeryLow, Low, Medium, High, Very High) with a much higher degree of accuracy than predicting the exact number as a continous variable. The RandomForests classifier discussed above provided an accuracy of between 50 to 61% (stddev 12 to 20%), it is used to predict the number of stores in countries which do not have a Starbucks presence yet but for which there is data available for all the features to enable prediction. The features used for prediction are listed in Hypothesis 2 and are not being repeated here. Since the original WorldBank data has a lot of missing data therefore predictions cannot be made for all countries where Starbucks does not have a presence yet. The following table lists the results of the prediction. (VL: Very Low, L: Low, M: Medium, H: High, VH; Very High). The categories translate into the following numerical values:

Range Category
[1,3] Very Low
[4,6] Low
[7,41] Medium
[42, 128] High
[129, beyond Very High

In [1]:
import pandas as pd
RESULTS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/classification/Num.Starbucks.Stores.Categorical_predictions.csv'
df = pd.read_csv(RESULTS_URL)
df


Out[1]:
Num.Starbucks.Stores.Categorical country_codes names
0 H 1W World
1 H XC Euro area
2 H XF IBRD only
3 H Z7 Europe & Central Asia
4 H T7 Europe & Central Asia (IDA & IBRD countries)
5 H XJ Latin America & Caribbean (excluding high income)
6 H T2 Latin America & the Caribbean (IDA & IBRD coun...
7 H EC Ecuador
8 H XH IDA blend
9 H XT Upper middle income
10 L BF Burkina Faso
11 L UG Uganda
12 L PY Paraguay
13 L MN Mongolia
14 M LV Latvia
15 M DO Dominican Republic
16 M HK Hong Kong SAR, China
17 M DZ Algeria
18 M IT Italy
19 M EE Estonia
20 M AM Armenia
21 M LK Sri Lanka
22 M MU Mauritius
23 M HR Croatia
24 M MD Moldova
25 M GY Guyana
26 M GE Georgia
27 M BJ Benin
28 M AL Albania
29 M MK Macedonia, FYR
30 M ZQ Middle East & North Africa
31 M LT Lithuania
32 M ME Montenegro
33 M XU North America
34 M NP Nepal
35 M UA Ukraine
36 M B8 Central Europe and the Baltics
37 M RW Rwanda
38 M JM Jamaica
39 M SI Slovenia
40 VH XD High income
41 VH OE OECD members
42 VH EU European Union
43 VL HN Honduras
44 VL SL Sierra Leone
45 VL SR Suriname
46 VL IS Iceland
47 VL MZ Mozambique
48 VL TZ Tanzania
49 VL 7E Europe & Central Asia (excluding high income)
50 VL IL Israel
51 VL UY Uruguay
52 VL NI Nicaragua
53 VL BA Bosnia and Herzegovina

Discussion of the prediction results

The prediction throws up some interesting results. While some interpretation is being provided here, it would upto a subject matter expert (and well, Starbucks) to say why these groupings make sense (or not). A point to note here (which has not been discussed earlier) is that the WolrdBank data also had data for groups of countries (such as OECD) combined as a single entity so these groups also appear in the following classification.

Category Notes
Very High High income countries, OECD countries (Switzerland, Turkey, U.S., U.K.) and the EU show up in this category. This makes intuitive sense: high income, developed nations have a very high number of Starbucks stores. An interesting result here is the OECD countries group (U.S., U.K., Turkey, Switzerland), since a Very High or High number of Starbucks already exists in these countries, so this speaks to the accuracy of the model.
High Several groups show up in this category. The world overall is classified as a High Starbucks region.
Very Low This group consists of some African, South American countries. It is a matter of study to say why these countries are clustered togehter. Israel and Iceland could be an outlier in this list.
Medium This is the single largest group consisting of 29 countries from all over the world. Italy shows up in this list, whereas one would expect that there would already be Starbucks in Italy. This list can be combined with other aspects (see visualizations) such as tourist inflow to prioritize which countries should be higher on the list for Starbucks to enter. A few countries worth mentioning from this list are Sri Lanka, Mauritius, Jamaica and European countries like Italy, Albania, Lithuania, Montenegro, Slovenia etc.
Low This is a small list, couple of African, one Asian and one South American country.

Time Series Analysis for number of Starbucks stores [Additional Analysis 1]

A piece of information available in the Starbucks dataset is the date on which a store was first seen and the dataset contains information from December 2013 to the current month i.e. November 2016. The first seen field can be used to identify (say) a month on month increase in the number of stores i.e. stores for which the first seen date is say a date in January 2014 can all be considered as newly opened stores in January 2014 (this was also empirically verified by the fact that a new store that opened in Clarksburg, Maryland that the author is personally aware of shows up with a first seen date of November 2016).

This information about the first seen date can be used to do a time series analysis to examine and predict the number of Starbucks stores in the next 12 months. To do this the following steps were followed:

  1. Extract a time series of interest i.e. total number of Starbucks stores across the world and plot it to visually examine how the number of stores was increasing over time.

  2. Examine the time series for being stationary i.e. is the mean of the series increasing with time. The idea behind requiring the time series to be stationary is that if the time series is stationary then individual points can be considered I.I.D (independent identically distributed) and the usual concepts of stationary random variables (Central Limit Theorem, Law of Large Numbers) can be applied to it. If the series is not stationary then it needs to be made (read transformed into a) stationary series. There are several transformations to stationarize a series such as Logarithmic, Deflation by CPI, first difference etc.

    • From visual examination of the number of Starbucks stores time series it is clear that the series is not stationary (number of stores is increasing with time). A Dickey-Fuller test could also have been done to check for stationarity but it has not been done as part of this project. The logarithmic transformation was used to stationarize the time series.
    • Seasonal decomposition of the Starbucks stores time series is also done to extract out trend and seasonal variations. Plots are presented below.
  3. The Auto Regressive Integrated Moving Average (ARIMA) model is used to forecast the timeseries. In general, ARIMA models are applied in some cases where data show evidence of non-stationarity (like in case of the Starbucks store data), where an initial differencing step (corresponding to the "integrated" part of the model) can be applied to reduce the non-stationarity. See https://www.otexts.org/fpp/8/1 for more details about time series and https://people.duke.edu/~rnau/411arim.htm for an explanation on ARIMA.

    • The parameters (p,d,q) of the non-seasonal ARIMA model are are non-negative integers, p is the order (number of time lags) of the autoregressive model, d is the degree of differencing (the number of times the data have had past values subtracted), and q is the order of the moving-average model. The (p,d,q) values are generally identified by studying the ACF/PACF values, in this case, they were identified by trial and error (the values that gave the least mean square error were used).

    • A Seasonal ARIMA (SARIMA) could have been tried but current stable release of the statsmodels package (0.6.1) does not support it, although it is available in a version currently under development (0.7.1).

  4. The ARIMA model is provided the logged timeseries for reasons discussed above. The fitted values provided by the model are actually the differences of the logged series, these values are then cumulatively summed to derive the actual logged series. Once the predicted logged series is obtained then it is transformed back to its original value by taking an exponent (e^x) of the logged value. The forecasted values for future timestamps is obtained using the same procedure.

The following sections discuss time series analysis for two timeseries, one for the total number of stores across the world and another one for the number of stores in the state of New York. Several other TSA analysis (such as time series for number of stores in the U.K, India, the continent of Africa) are included in the project submission.

TSA for Starbucks Stores across the World

There are four plots provided below.

  1. Seasonal decomposition: clearly shows that there is an almost linear upwards trend as well as a seasonal element (a frequency value of 5 was used).
  2. First difference of the logged difference along with the fitted value from the model.
  3. Original series with the fitted values (transformed back from the logged differences provided by the model). Also shown here is the root mean squared error (RMSE) to determine how good the prediction is. It is seen that the RMSE is ~152 which is roughly 0.60% of the total number of stores (25090), so the fitted value is off by less than 1% of the original value.
  4. Original series with fitted value and projected values. The projected values are for the next 12 month period i.e. from January 2017 to December 2017. The projected number of Starbucks stores in the World at the end of 2017 should be around 27995.

TSA for Starbucks Stores in the state of New York

Same four plots as above are provided for the timer series analysis for number of stores in the state of New York as well. The RMSE is again less than 1%. Forecasted number of stores in the state of New York at the end of 2017 is ~692, which would be an addition of 53 stores from the current number of 639.

Exploring distribution of Starbucks stores across the world [Additional Analysis 2]

This analysis explores the distribution of the number of Starbucks stores in countries which have a Very High density of Starbucks stores (> 1 Starbucks store per 100,000 people) and at least 100 Starbucks stores across the country. The number of stores per city in the country is modeled as a random variable and its distribution is examined. The following table provides the list of countries that have been studied, note that Singapore which has a 125 stores across the country is not included in this list since there are only two Cities in Singapore one of which has a single Starbucks store.


In [3]:
import pandas as pd
RESULTS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/more/cities_withmost_stores.csv'
df = pd.read_csv(RESULTS_URL)
df


Out[3]:
country city_with_most_starbucks_stores count_in_city_with_most_stores
0 United States New York 231
1 United Kingdom London 195
2 UAE Dubai 75
3 Kuwait مدينة الكويت 74
4 Republic of Korea Seoul 234
5 Canada Toronto 191

From the above table it is clear that the maximum number of stores occurs in the captial city of the country except in the case of the U.S. where the maximum number of stores in a single city is in New York (which is not a surprise).

The following boxplots and histogram show the distribution of Starbucks stores across cities in the above mentioned countries. As is clear from these plots, the distribution is right skewed with very large difference between the third quartile and the maximum (because the one city with the maximum number of Stores is an outlier).

In order to obtain any insight from the above plots the outliers need to be removed. This is done by "Winsorizing" (see https://en.wikipedia.org/wiki/Winsorizing) the distribution at 0.05 level i.e. all greater than the 95th percentile are clamped at the 95th percentile value. The following are the plots after Winsorization. As seen from the plots it is evident that the basic nature of the plots has not changed after Winsorization but at least now it is clear that median number of stores per city for all of these countries is around 3 and the 3rd quartile is less than 7. The shape of the distribution as seen from the histograms is similar for all of these countries with the U.S. having the maximum number of stores in almost all percentiles. The shape of the distribution is discussed in the next section.

Looking at the histogram above, this distribution looks similar to an exponential distribution or at least something from that family of distributions. To understand this better, the distributions for each of these countries was examined individually and it was found by visual inspection that a Exponential Logarithmic distribution or an Erlang distribution fitted this best. The Seaborne package for making plots also provides the option to define a "fit" function and then it overlays the fitted plot on the actual data.

The following plots show the CDF (cumulative distribution function) plots as well as the histograms with the fitted curve from the Erlang distribution. The plots provided here are for the U.S. and the U.K. (rest of the plots are included in the submission package). The Erlang distribution is used to model incoming calls in a telephone system, waiting times in queuing , business economics for describing interpurchase times etc. In the context of this data science problem, we could think of applying this distribution to model number of Starbucks stores in cities of a country where Starbucks does not exist today. The distribution could define the number of stores in the capital city (which would be an outlier) and number of stores in other cities in the country which would be within the 1st quartile, 3rd quartile and so on. This could be used in conjunction with the previous analysis where the number of Starbucks stores is predicted as a categorical variable, so from there a range for the total number of stores in the country is known and this range could then be modeled using the Erlang distribution to distribute the total number of stores across multiple cities. This is a topic for further research.

Visualizations

This sections provides several visualizations done as part of this project to better understand the data at hand. These visualizations are discussed individually in the subsequent sections.

Starbucks around the world [Visualization 1, Bokeh]

The following visualization shows all Starbucks stores as of November 2016, overlayed over a world map. The brown dots on the map represent locations of Starbucks stores. A hover over individual point brings a popout text which provides the name of location where the store exists along with the name of the city. Since the map is displayed from a certain zoom level therefore the points representing Starbucks stores appear very close to each other in high density areas.

The visualization clearly shows that a lot of Starbucks stores are located on the eastern and western coast of the US, the UK in Europe and China and Japan in Asia as well as Mayasia and Indonesia in south east Asia. There are relatively fewer stores in the South America and extremely few stores in Africa.

This visualization has been created using GMapPlot and Bokeh.


In [15]:
from IPython.core.display import display, HTML
import requests as req
filename='locations.html'
display(HTML(filename))


Bokeh Plot

Starbucks in the U.S. [Visualization 2, SVG]

With more than 13,000 Starbucks stores, the U.S. has the most Starbucks stores in the World by a huge margin, but how are Starbucks stores distributed within the US? This is what this next visualization tries to explore by creating a heatmap of U.S. counties, more the number of stores, more intense the color of the map. Doing a mouse over a county brings up the name of the county and the number of stores in that county. The stores are expectedly most on the east coast and the west coast and very few in the states at the geographical center of the country like in Nebraska, South Dakota, Kansas etc.

Maximum number of stores, 665, occur in Los Angeles county in California. New York county (234) in New York state, Maricopa county (334 stores) in Arizona, Cook county (325 stores) in Illinois, King county (336 stores) in Washington are some of the counties with a very high number of Starbucks stores. Just from visual examination of the graphic it appears that the Ney York county (which includes Manhattan) would probably have the highest density of Starbucks stores in terms of number of stores divided by the size of the county.

This plot has been created using a combination of an SVG (Support Vector Graphics) image of the US map with counties marked on it and the FCC (Federal Communications Commission) API that provides the name of the county given a locations latitude and longitude. All Starbucks locations are mapped to a county name using the FCC API in an offline process and a count is maintained for the number of stores in each county. Once the store to county mapping is complete the result is stored in a CSV file. This CSV file is then used to assign color and hover text to the counties as listed in the SVG file. The counties are colored using shades in a single color pallete with 0 (no Starbucks stores) in the county being the lightest to greater than 300 stores being the darkest.

Number of stores Color Intensity (increases with number of stores)
0 Lightest
1
[2, 15)
[16,150)
[151, 300)
(301 and greater) Darkest

In [13]:
import IPython
fname = 'us_counties_starbucks_stores.html'
iframe = '<iframe src=' + fname + ' width=1000 height=550 scrolling=no frameBorder=0></iframe>'
IPython.display.HTML(iframe)


Out[13]:

Starbucks stores | Internet Users Vs International Tourists Arrivals [Visualization 3, Plotly]

As discussed in previous sections, the number of international tourist arrivals per year and internet users (per 100 people) are two important features which have been seen to influence the number of Starbucks stores in a country. These features were determined to be two of the more important features by the ExtraTreesClassifier and even intuitively it does make sense that countries with a large number of tourist arrivals and high Internet accessibility should have a higher number of Starbucks stores.

This relationship between number of Starbucks stores, Internet users and International tourists is depicted using a bubble chart. The bubble chart represents 4 dimensional data as described below.

  1. The X axis represent the international tourist arrivals per year.
  2. The Y axis represent the number of Internet users per 100 people.
  3. Each country is represented by a bubble, the diameter of the bubble representing the number of Starbucks stores in the country. Higher the number of stores, larger the bubble size.
  4. The color of the bubble represents the continent to which the country corresponding to the bubble belongs to.

The bubble chart presented below is created using Plotly, mouse over the bubbles to see information about the country, number of Starbucks stores etc. As with all Plotly charts, basic user interactions like selecting specific areas of the chart, saving as png etc are available. All the data required for the bubble chart is available as part of the combined World Band Starbucks dataset.


In [9]:
import IPython
url = 'https://plot.ly/~aa1603/2.embed?link=false'
iframe = '<iframe width=\"1000\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)


Out[9]:

Some important observations from the above visualization are as follows:

  1. Almost all the bubbles which are medium to large in size all occur towards the right side of the 10 Million mark on the x-axis which indicates that unless a country has north of 10 Million international tourists arrival a year it would have very small number of Starbucks store probably around 20 or so with the only exceptions being Brazil, Argentina, Philippines and Indonesia.

  2. The green colored bubbles which correspond to Asia are very visible (most of them are at least medium size with China being the largest). This seems indicates that countries in Asian generally tend to have a higher number of Starbucks stores than counties in other continents. This is also indicated in the boxplot drawn in a previous section, the mean number of Starbucks stores was the highest for Asia.

  3. The largest bubble towards the top right side of the chart is ofcourse the U.S., high availability of Internet access and a very large number of tourists.

  4. France which is the blue colored bubble on the top right corner is relatively much smaller than bubbles representing other countries with high availability of Internet access and a large influx of tourists. This looks like an aberration, France has only 124 Starbucks stores, at more than 80 Million tourists per year and more than 80 out of 100 people having access to the Internet, France should be having a lot more Starbucks stores than just 124.

Increase in Starbucks stores worldwide [Visualizations 4 & 5, Plotly]

The following bar chart shows the increase in Starbucks stores by country from December 2013 to November 2016. Mouseover individual bars to see the name of the country and the number of Starbucks stores added in that country between December 2013 to November 2016. As with all Plotly charts, basic user interactions like selecting specific areas of the chart, saving as png etc are available.


In [12]:
import IPython
url = 'https://plot.ly/~aa1603/6.embed?link=false'
iframe = '<iframe width=\"1000\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)


Out[12]:

Some observations from the above visualization:

  1. The maximum increase in terms of absolute number of stores happened in the U.S. followed by China, South Korea, Canada and Great Britain.
  2. Most of the growth is concentrated in the top 10 countries, in fact the increase in the number of stores in U.S. and China put together is way more than the increase in the number of stores in the rest of the world put together (almost 1.35 times more).

While the increase in absolute terms is useful, it is also interesting to find out which countries have the highest rate of increase, basically where is Starbucks expanding fastest. This list may not be the same as the list of countries which have the greatest growth in terms of absolute numbers. To determine this, the following procedure is follows:

  1. A month on month increase in number of Starbucks store is determined for each country.
  2. It is then converted to a percentage of the next month's value.
  3. This percentage month on month increase is then totaled up for the entire duration from December 2013 to November 2016.
  4. The countries with the highest cumulative percentage increase are determined.
    • Note that countries with less than 25 stores as of November 2016 are excluded from this analysis (anything less than 25 stores is considered too small to be significant).

The timeseries for the number of stores for 15 countries having the highest cumulative percentage increase is represented below as Sparkline charts (more on Sparklines https://en.wikipedia.org/wiki/Sparkline). Note that this list does not include the U.S. so even though the U.S. had the highest increase in terms of absolute numbers the fastest rate of increase is somewhere else. It is seen from this visualization (and confirmed from the actual numbers, not shown here) that the maximum cumulative percentage increase happens in India where the number of stores increased from 29 in December 2013 to 85 in November 2016.


In [11]:
import IPython
url = 'https://plot.ly/~aa1603/74.embed?link=false'
iframe = '<iframe width=\"800\" height=\"1000\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)


Out[11]:

Limitations and future work

The following limitations and scope for future study is identified:

  1. Many WDI features had to be discarded because there just wasn't sufficient data available, it is possible that some of these missing features might have been important from a prediction perspective. As and when data for these features becomes available then those features can also be included in the analysis.
  • Predictions could not be provided for all countries where Starbucks does not have a presence because data for features used for prediction was not available for these countries, once this data becomes available then these predictions can also be provided.
  1. Are we able to increase the accuracy of predictions by reducing the number of categories to 3 (Low, Medium, High) instead of the current 5? This will create a tradeoff between having these categories defined over much narrower interval (and hence being more effective) and the accuracy with which these categories can be predicted.
  2. The python statsmodels module would provide SARIMA (Seasonal ARIMA) at some point (hopefully soon) then the TSA analysis would be done with that model as well.
  • The parameters of the SARIMA module would be determined by observing the ACF/PACF values, something which has not been done in the current analysis.
  1. The distribution of Starbucks stores across geographies should also be examined by taking a log of the count of Starbucks stores in a city and then creating a histogram would transform the distribution into something more manageable. See plot below. As with all Plotly charts, basic user interactions like selecting specific areas of the chart, saving as png etc are available.

In [10]:
import IPython
url = 'https://plot.ly/~aa1603/76.embed?link=false'
iframe = '<iframe width=\"800\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)


Out[10]:

In [ ]: